Food Delivery Time Prediction - Analysis¶
Given the number of methods and models, the code may take a few minutes to execute.
If you want to view the results without running it, go to the file Food_delivery_analisis.html.
Setup¶
As a first step, we import the necessary libraries and tools.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from scipy.stats import chi2_contingency
from scipy.stats import boxcox
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize, poly)
from functools import partial
from sklearn.model_selection import (cross_validate, KFold, ShuffleSplit, train_test_split, cross_val_score, GridSearchCV)
from sklearn.base import clone
from ISLP.models import sklearn_sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import confusion_matrix, mean_absolute_error, r2_score
from IPython.display import HTML
The dataset Food Delivery Times is imported from the Git directory.
After examining the contents of the database, we observe the presence of null values. It has been decided to remove these rows.
Null values are removed because they can interfere with the calculations of the model, leading to inaccurate results.
Food = pd.read_csv("dataset/Food_Delivery_Times.csv")
print("Number of rows in the original dataset:", len(Food))
null_rows = Food.isnull().any(axis=1).sum()
print("Number of rows with missing values:", null_rows)
Food = Food.dropna()
print("Number of rows after removing missing values:", len(Food))
Food.head()
Number of rows in the original dataset: 1000 Number of rows with missing values: 117 Number of rows after removing missing values: 883
| Order_ID | Distance_km | Weather | Traffic_Level | Time_of_Day | Vehicle_Type | Preparation_Time_min | Courier_Experience_yrs | Delivery_Time_min | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 522 | 7.93 | Windy | Low | Afternoon | Scooter | 12 | 1.0 | 43 |
| 1 | 738 | 16.42 | Clear | Medium | Evening | Bike | 20 | 2.0 | 84 |
| 2 | 741 | 9.52 | Foggy | Low | Night | Scooter | 28 | 1.0 | 59 |
| 3 | 661 | 7.44 | Rainy | Medium | Afternoon | Scooter | 5 | 1.0 | 37 |
| 4 | 412 | 19.03 | Clear | Low | Morning | Bike | 16 | 5.0 | 68 |
variables = {
'Order_ID': 'Order ID',
'Distance_km': 'Distance (km)',
'Weather': 'Weather',
'Traffic_Level': 'Traffic Level',
'Time_of_Day': 'Time of Day',
'Vehicle_Type': 'Vehicle Type',
'Preparation_Time_min': 'Preparation Time (min)',
'Courier_Experience_yrs': 'Courier Experience (yrs)',
'Delivery_Time_min': 'Delivery Time (min)',
'Courier_Experience': 'Courier Experience'
}
Outliers are extreme values that can distort analysis and lead to misleading results.
We used the Interquartile Range (IQR) method to detect and remove them.
This ensures cleaner data and more reliable outcomes in the following analyses.
Food_original = Food.copy()
xlims = []
plt.figure(figsize=(20, 14))
font_size = 16
tick_size = 14
numeric_columns = Food.select_dtypes(include=['number']).columns.tolist()
for i, col in enumerate(numeric_columns, 1):
plt.subplot(2, len(numeric_columns), i)
ax = sns.boxplot(x=Food_original[col])
plt.xlabel(f"{variables.get(col, col)} - Before", fontsize=font_size)
plt.tick_params(axis='both', labelsize=tick_size, color='black', width=0.5)
xlims.append(ax.get_xlim())
def remove_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
for col in numeric_columns:
Food = remove_outliers(Food, col)
for i, col in enumerate(numeric_columns, 1):
plt.subplot(2, len(numeric_columns), len(numeric_columns) + i)
ax = sns.boxplot(x=Food[col])
plt.xlim(xlims[i-1])
plt.xlabel(f"{variables.get(col, col)} - After", fontsize=font_size)
plt.tick_params(axis='both', labelsize=tick_size, color='black', width=0.5)
for ax in plt.gcf().get_axes():
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
plt.subplots_adjust(left=0.05, right=0.95, top=0.95, bottom=0.05, wspace=0.4, hspace=0.4)
plt.tight_layout()
plt.show()
print("Number of rows after removing outliers", len(Food))
Number of rows after removing outliers 879
The boxplots show the distribution of numerical variables before and after outlier removal.
The dataset shows good quality as it contains few outliers.
The cleaned data is now more consistent and less influenced by rare or abnormal values.
We now present the distribution of each variable to understand how frequently different values occur, both for numerical features and categorical features
title_fontsize = 20
label_fontsize = 15
tick_fontsize = 13
numerical_cols = ['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs', 'Delivery_Time_min']
for col in numerical_cols:
plt.figure(figsize=(8, 6))
plt.hist(Food[col], bins=5, color='skyblue', edgecolor='black')
plt.title(f'Distribution of {variables[col]}', fontsize=title_fontsize)
plt.xlabel(variables[col], fontsize=label_fontsize)
plt.ylabel('Frequency', fontsize=label_fontsize)
plt.xticks(fontsize=tick_fontsize)
plt.yticks(fontsize=tick_fontsize)
plt.tight_layout()
plt.show()
categorical_cols = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
for col in categorical_cols:
plt.figure(figsize=(8, 6))
Food[col].value_counts().plot(kind='bar', color='lightgreen', edgecolor='black')
plt.title(f'Distribution of {variables[col]}', fontsize=title_fontsize)
plt.xlabel(variables[col], fontsize=label_fontsize)
plt.ylabel('Frequency', fontsize=label_fontsize)
plt.xticks(fontsize=tick_fontsize, rotation=0)
plt.yticks(fontsize=tick_fontsize)
plt.tight_layout()
plt.show()
Simple Linear Regression¶
Before building regression models, we assess the quality of the dataset by analyzing the correlation between variables through the construction of a correlation matrix.
numeric_df = Food.select_dtypes(include=[np.number])
if numeric_df.shape[1] >= 4:
plt.figure(figsize=(20, 18))
corr = numeric_df.corr()
corr.index = [variables.get(col, col) for col in corr.index]
corr.columns = [variables.get(col, col) for col in corr.columns]
ax = sns.heatmap(
corr,
annot=True,
fmt='.2f',
cmap='coolwarm',
annot_kws={"size": 25},
cbar_kws={"shrink": 0.8},
linewidths=0.7,
square=True,
linecolor='gray'
)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=24, rotation=45, ha='right')
ax.set_yticklabels(ax.get_yticklabels(), fontsize=24, rotation=0)
ax.set_title("Correlation Heatmap of Numeric Features", fontsize=30, pad=30)
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=20)
plt.tight_layout()
plt.show()
In our case, some variables show a clear linear relationship with delivery time, indicating their potential importance in the regression model.
Additionally, the predictors are not highly correlated with each other, suggesting that the dataset is suitable for linear regression and that multicollinearity is not a concern.
The two numerical variables Distance (km) and Preparation Time (min) are selected, and the relationship between each of them and the response variable Delivery Time (min) is analyzed individually.
The variable Order ID is completely irrelevant to our objective, so we do not take it into account.
Courier Experience (yrs) is numeric from 0 to 9, but we will consider it as categorical because the values represent discrete levels of experience and their relationship with the target variable may not be linear.
X = pd.DataFrame({
'intercept': np.ones(Food.shape[0]),
variables['Distance_km']: Food['Distance_km']
})
X[:4]
| intercept | Distance (km) | |
|---|---|---|
| 0 | 1.0 | 7.93 |
| 1 | 1.0 | 16.42 |
| 2 | 1.0 | 9.52 |
| 3 | 1.0 | 7.44 |
def summarize(results):
table = results.summary2().tables[1]
table = table.drop(columns=['[0.025', '0.975]'], errors='ignore')
rows = []
for idx, row in table.iterrows():
coef = row['Coef.']
pval = row['P>|t|']
opacity = "opacity: 0.35;" if pval >= 0.05 else ""
coef_color = 'green' if coef >= 0 else 'red'
coef_html = f'<td style="color: {coef_color}; font-weight: bold;">{coef:.3f}</td>'
pval_html = f'<td style="font-weight: bold;">{pval:.3f}</td>'
html_row = (
f'<tr style="{opacity}">'
f'<th style="font-weight: bold;">{idx}</th>{coef_html}'
f'<td>{row["Std.Err."]:.3f}</td>'
f'<td>{row["t"]:.3f}</td>{pval_html}</tr>'
)
rows.append(html_row)
header = (
"<tr><th></th><th>Coef.</th><th>Std.Err.</th><th>t</th><th>P>|t|</th></tr>"
)
html_table = (
f"<table border='1' class='dataframe'>"
f"<thead>{header}</thead><tbody>{''.join(rows)}</tbody></table>"
)
return HTML(html_table)
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| Coef. | Std.Err. | t | P>|t| | |
|---|---|---|---|---|
| intercept | 26.867 | 0.890 | 30.184 | 0.000 |
| Distance (km) | 2.918 | 0.077 | 37.746 | 0.000 |
Distance has a p-value of 0.000, indicating that it is highly significant.
The coefficient is positive, meaning that as the distance increases, the waiting time also increases.
The intercept is high, suggesting that distance alone does not sufficiently explain the delivery times.
Model prediction and confidence intervals¶
We define a linear regression design matrix including an intercept and the explanatory variable Distance (km) using the ModelSpec class from ISLP.
This allows for consistent and automatic construction of the design matrix for both the original dataset and new observations.
We then generate predictions and 95% confidence intervals for new values of the distance based on the previously fitted model.
model = MS(['Distance_km'])
model = model.fit(Food)
X = model.transform(Food)
X.columns = [variables.get(col, col) for col in X.columns]
X[:4]
| intercept | Distance (km) | |
|---|---|---|
| 0 | 1.0 | 7.93 |
| 1 | 1.0 | 16.42 |
| 2 | 1.0 | 9.52 |
| 3 | 1.0 | 7.44 |
new_df = pd.DataFrame({'Distance_km':[10, 12, 14, 16, 20]})
newX = model.transform(new_df)
newX.columns = [variables.get(col, col) for col in X.columns]
newX
| intercept | Distance (km) | |
|---|---|---|
| 0 | 1.0 | 10 |
| 1 | 1.0 | 12 |
| 2 | 1.0 | 14 |
| 3 | 1.0 | 16 |
| 4 | 1.0 | 20 |
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
array([56.047364 , 61.88336722, 67.71937045, 73.55537368, 85.22738013])
new_predictions.conf_int(alpha=0.05)
array([[55.18622211, 56.90850588],
[60.97123424, 62.79550021],
[66.66742956, 68.77131134],
[72.30423639, 74.80651096],
[83.48514315, 86.96961711]])
def abline(ax, b, m, *args, **kwargs):
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
ax = Food.plot.scatter('Distance_km', 'Delivery_Time_min', s=15)
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=2)
ax.set_xlabel(variables.get('Distance_km', 'Distance_km'), labelpad=15, fontsize=15)
ax.set_ylabel(variables.get('Delivery_Time_min', 'Delivery_Time_min'), labelpad=15, fontsize=15)
ax.tick_params(axis='both', which='both', color='black', width=0.5, labelsize=13)
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
ax.get_figure().canvas.draw()
The scatter plot shows a positive correlation between distance in kilometers and delivery time in minutes.
The data points align well with the linear trend line, indicating that the model is accurate and useful for predicting delivery times based on distance.
Fitted values vs residuals¶
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(results.fittedvalues, results.resid, s=15)
ax.set_xlabel('Fitted value', fontsize=15, labelpad=15)
ax.set_ylabel('Residual', fontsize=15, labelpad=15)
ax.axhline(0, c='k', ls='--')
ax.tick_params(axis='both', which='both', color='black', width=0.5, labelsize=13)
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
fig.tight_layout()
The residuals are spread around the zero line, indicating that the model fits the data well.
There are no clear patterns in the residuals, suggesting the model's assumptions about linearity are valid.
Leverage¶
infl = results.get_influence()
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag, s=15)
ax.set_xlabel('Index', fontsize=15, labelpad=15)
ax.set_ylabel('Leverage', fontsize=15, labelpad=15)
ax.tick_params(axis='both', which='both', color='black', width=0.5, labelsize=13)
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
fig.tight_layout()
_ = np.argmax(infl.hat_matrix_diag)
The leverage values are low, meaning no single data point has too much impact on the model.
The leverage level for outliers is 0.0045. Since almost all data points are below this threshold, the model is well-balanced.
X = pd.DataFrame({
'intercept': np.ones(Food.shape[0]),
variables['Preparation_Time_min']: Food['Preparation_Time_min']
})
X[:4]
| intercept | Preparation Time (min) | |
|---|---|---|
| 0 | 1.0 | 12 |
| 1 | 1.0 | 20 |
| 2 | 1.0 | 28 |
| 3 | 1.0 | 5 |
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| Coef. | Std.Err. | t | P>|t| | |
|---|---|---|---|---|
| intercept | 41.295 | 1.727 | 23.913 | 0.000 |
| Preparation Time (min) | 0.870 | 0.093 | 9.322 | 0.000 |
Preparation Time (min) has a p-value of 0.000, indicating that it is highly significant.
The coefficient is positive, meaning that as the preparation time increases, the waiting time also increases.
The intercept is relatively high, suggesting that preparation time alone does not fully explain the total waiting time.
Model prediction and confidence intervals¶
model = MS(['Preparation_Time_min'])
model = model.fit(Food)
X = model.transform(Food)
X.columns = [variables.get(col, col) for col in X.columns]
X[:4]
| intercept | Preparation Time (min) | |
|---|---|---|
| 0 | 1.0 | 12 |
| 1 | 1.0 | 20 |
| 2 | 1.0 | 28 |
| 3 | 1.0 | 5 |
new_df = pd.DataFrame({'Preparation_Time_min':[10, 12, 14, 16, 20]})
newX = model.transform(new_df)
newX.columns = [variables.get(col, col) for col in X.columns]
newX
| intercept | Preparation Time (min) | |
|---|---|---|
| 0 | 1.0 | 10 |
| 1 | 1.0 | 12 |
| 2 | 1.0 | 14 |
| 3 | 1.0 | 16 |
| 4 | 1.0 | 20 |
new_predictions = results.get_prediction(newX)
new_predictions.predicted_mean
array([49.99800226, 51.73863378, 53.47926529, 55.21989681, 58.70115983])
new_predictions.conf_int(alpha=0.05)
array([[48.14830241, 51.84770211],
[50.12183792, 53.35542964],
[52.03868239, 54.91984819],
[53.87634173, 56.56345188],
[57.26216877, 60.1401509 ]])
def abline(ax, b, m, *args, **kwargs):
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
ax = Food.plot.scatter('Preparation_Time_min', 'Delivery_Time_min', s=15)
abline(ax, results.params.iloc[0], results.params.iloc[1], 'r--', linewidth=2)
ax.set_xlabel(variables.get('Preparation_Time_min', 'Preparation_Time_min'), fontsize=15, labelpad=15)
ax.set_ylabel(variables.get('Delivery_Time_min', 'Delivery_Time_min'), fontsize=15, labelpad=15)
ax.tick_params(axis='both', which='both', color='black', width=0.5, labelsize=13)
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
ax.get_figure().canvas.draw()
The scatter plot shows a positive correlation between preparation time and delivery time in minutes.
The data points generally align with the linear trend line, suggesting that the model is reasonably good.
However, compared to the previous model, the dispersion of the data points appears to increase.
Additionally, there is noticeable vertical overlap between points at similar time values, indicating that other factors beyond distance may also influence delivery times.
Fitted values vs residuals¶
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(results.fittedvalues, results.resid, s=15)
ax.set_xlabel('Fitted value', fontsize=15, labelpad=15)
ax.set_ylabel('Residual', fontsize=15, labelpad=15)
ax.axhline(0, c='k', ls='--')
ax.tick_params(axis='both', which='both', color='black', width=0.5, labelsize=13)
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
fig.tight_layout()
The residuals appear to be fairly symmetrically distributed above and below the zero line, which supports the assumption of linearity in the model.
However, the residuals are widely spread, showing a noticeable vertical dispersion.
Leverage¶
infl = results.get_influence()
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag, s=15)
ax.set_xlabel('Index', fontsize=15, labelpad=15)
ax.set_ylabel('Leverage', fontsize=15, labelpad=15)
ax.tick_params(axis='both', which='both', color='black', width=0.5, labelsize=13)
for spine in ax.spines.values():
spine.set_visible(True)
spine.set_color('black')
spine.set_linewidth(0.5)
fig.tight_layout()
np.argmax(infl.hat_matrix_diag)
_ = np.argmax(infl.hat_matrix_diag)
Leverage values across the dataset are low, indicating that no individual observation exerts excessive influence on the model's predictions.
The threshold for high leverage is 0.0045, and all points fall below this value.
From this point of view, the model is very good.
X = MS(['Distance_km', 'Preparation_Time_min']).fit_transform(Food)
X.columns = [variables.get(col, col) for col in X.columns]
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| Coef. | Std.Err. | t | P>|t| | |
|---|---|---|---|---|
| intercept | 10.670 | 1.175 | 9.082 | 0.000 |
| Distance (km) | 2.950 | 0.066 | 44.680 | 0.000 |
| Preparation Time (min) | 0.933 | 0.052 | 18.079 | 0.000 |
The results of the linear regression model indicate that distance and preparation time significantly influence food delivery time (p-values = 0.000).
Specifically, delivery time increases by approximately 3.00 minutes per additional kilometer.
It also increases by 1 minutes for each additional minute of preparation.
R2 = results.rsquared
RSE = np.sqrt(results.scale)
print("R2:", round(R2, 3))
print("RSE:", round(RSE, 3))
R2: 0.723 RSE: 11.107
The linear regression model explains approximately 72% of the variance in food delivery time, indicating a strong fit.
The residual standard error (RSE) is about 11 minutes, suggesting a moderate prediction error given the likely range of delivery times.
Overall, the model performs well.
Qualitative predictors¶
Now we examine the remaining variables, which are categorical.
Before starting, we adjust the variable Courier Experience (yrs) by converting it into a categorical one, as previously mentioned during the analysis of correlations between numerical variables.
To do this, we create a new column Courier Experience with the following values:
- Low for experience from 0 to 2 years,
- Medium for experience from 3 to 5 years,
- High for experience greater than 5 years.
def categorize_experience(x):
if x <= 2:
return 'Low'
elif x <= 5:
return 'Medium'
else:
return 'High'
Food['Courier_Experience'] = Food['Courier_Experience_yrs'].apply(categorize_experience)
nan_before = Food['Courier_Experience_yrs'].isna().sum()
nan_after = Food['Courier_Experience'].isna().sum()
print(f"NaN prima della trasformazione: {nan_before}")
print(f"NaN dopo la trasformazione: {nan_after}")
NaN prima della trasformazione: 0 NaN dopo la trasformazione: 0
for col in Food.columns:
if Food[col].dtype == 'object':
col_name = variables.get(col, col)
print(f"{col_name}: {Food[col].unique()}")
Weather: ['Windy' 'Clear' 'Foggy' 'Rainy' 'Snowy'] Traffic Level: ['Low' 'Medium' 'High'] Time of Day: ['Afternoon' 'Evening' 'Night' 'Morning'] Vehicle Type: ['Scooter' 'Bike' 'Car'] Courier Experience: ['Low' 'Medium' 'High']
Before creating the new model, we verify that the categorical variables are not strongly correlated with each other, ensuring that each one contributes useful information to the model.
Since the variables are not numerical, we cannot use a standard correlation matrix.
Instead, we use Cramér's V, a statistical measure of association between two categorical variables, which ranges from 0 (no association) to 1 (perfect association).
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x, y)
chi2, _, _, _ = chi2_contingency(confusion_matrix)
n = confusion_matrix.sum().sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))
cramer_matrix = pd.DataFrame(
np.zeros((len(categorical), len(categorical))),
index=categorical,
columns=categorical
)
for var1 in categorical:
for var2 in categorical:
cramer_matrix.loc[var1, var2] = cramers_v(Food[var1], Food[var2])
cramer_matrix.columns = [variables.get(col, col) for col in cramer_matrix.columns]
cramer_matrix.index = [variables.get(row, row) for row in cramer_matrix.index]
plt.figure(figsize=(20, 18))
ax = sns.heatmap(
cramer_matrix,
annot=True,
fmt=".2f",
cmap='coolwarm',
square=True,
linewidths=0.7,
linecolor='gray',
cbar_kws={'shrink': 0.8},
annot_kws={"size": 20}
)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=24, rotation=45, ha='right')
ax.set_yticklabels(ax.get_yticklabels(), fontsize=24, rotation=0)
ax.set_title("Cramér's V of Categorical Features", fontsize=30, pad=30)
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=20)
plt.tight_layout()
plt.show()
The Cramér’s V matrix shows very low values, indicating that the categorical variables are essentially independent from each other.
No significant associations emerge, suggesting that each variable provides distinct and non-redundant information to the model.
Using the tools from ISLP, we build a regression model that includes all variables, both numerical and categorical.
Categorical variables are automatically handled by the library through one-hot encoding, a process that transforms each category into a separate binary variable.
This allows the model to interpret categorical information in a numerical form suitable for linear regression.
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience'
])
X = model.fit_transform(Food)
X.columns = [variables.get(col, col) for col in X.columns]
for i, col in enumerate(X.columns):
if '[' in col and ']' in col:
base_col = col.split('[')[0]
if base_col in variables:
X.columns.values[i] = col.replace(base_col, variables[base_col])
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| Coef. | Std.Err. | t | P>|t| | |
|---|---|---|---|---|
| intercept | 14.026 | 1.397 | 10.037 | 0.000 |
| Distance (km) | 2.912 | 0.056 | 51.796 | 0.000 |
| Weather[Foggy] | 8.116 | 1.057 | 7.676 | 0.000 |
| Weather[Rainy] | 4.996 | 0.837 | 5.967 | 0.000 |
| Weather[Snowy] | 10.399 | 1.125 | 9.244 | 0.000 |
| Weather[Windy] | 2.123 | 1.129 | 1.880 | 0.060 |
| Traffic Level[Low] | -12.292 | 0.869 | -14.152 | 0.000 |
| Traffic Level[Medium] | -6.115 | 0.865 | -7.072 | 0.000 |
| Time of Day[Evening] | -0.007 | 0.825 | -0.009 | 0.993 |
| Time of Day[Morning] | -0.852 | 0.816 | -1.044 | 0.297 |
| Time of Day[Night] | -1.533 | 1.237 | -1.239 | 0.216 |
| Vehicle Type[Car] | -0.264 | 0.854 | -0.309 | 0.757 |
| Vehicle Type[Scooter] | -1.387 | 0.738 | -1.878 | 0.061 |
| Preparation Time (min) | 0.947 | 0.044 | 21.568 | 0.000 |
| Courier Experience[Low] | 4.091 | 0.758 | 5.397 | 0.000 |
| Courier Experience[Medium] | 1.957 | 0.795 | 2.461 | 0.014 |
From the regression results, several observations can be made:
- Distance (km) has a strong and significant positive effect on delivery time.
- Adverse weather conditions like Foggy, Rainy, and Snowy significantly increase delivery time, compared to Clear weather.
- Traffic Level[Low] and Medium significantly reduce delivery time compared to High traffic, which is the baseline.
- Different times of the day (Morning, Evening, Night) do not show statistically significant differences (p-values > 0.05) from the baseline (Afternoon).
- Similarly, neither Car nor Scooter types show significant differences (p-values > 0.05) from the baseline vehicle (Bike), suggesting that vehicle type may not strongly influence delivery time in this model.
- Preparation Time (min) increases delivery time, as expected.
- Courier Experience[Low] and Medium increase delivery time compared to the baseline High.
R2 = results.rsquared
RSE = np.sqrt(results.scale)
print("R2:", round(R2, 3))
print("RSE:", round(RSE, 3))
R2: 0.804 RSE: 9.404
The linear regression model now explains approximately 80% of the variance in food delivery time, with a small improvement compared to the previous model (72%).
The residual standard error is now about 9 minutes, suggesting a slight deterioration in prediction accuracy, with a lower error compared to the previous model (11 minutes).
Overall, the model has improved in both explanatory power and prediction accuracy.
Observing the results, we notice that the p-values for Veicle Type and Time of Day are high.
Before removing them, we aim to better understand why vehicle type does not affect delivery time.
In real scenarios, using a car should reduce delivery time compared to using a bike.
vehicle_type_counts = Food['Vehicle_Type'].value_counts()
vehicle_type_counts.index.name = None
vehicle_type_counts.name = None
translated_column_name = variables.get('Vehicle_Type', 'Vehicle_Type')
print(f"{translated_column_name} counts:")
print(vehicle_type_counts.to_string())
Vehicle Type counts: Bike 450 Scooter 259 Car 170
delivery_vehicle_mean = Food.groupby('Vehicle_Type', observed=False)['Delivery_Time_min'].mean()
delivery_vehicle_mean.index.name = None
delivery_vehicle_mean.name = None
translated_column_name = variables.get('Delivery_Time_min', 'Delivery_Time_min')
print(f"{translated_column_name} mean:")
print(delivery_vehicle_mean.to_string())
Delivery Time (min) mean: Bike 56.473333 Car 56.841176 Scooter 54.965251
distance_vehicle_mean = Food.groupby('Vehicle_Type', observed=False)['Distance_km'].mean()
distance_vehicle_mean.index.name = None
distance_vehicle_mean.name = None
translated_column_name = variables.get('Distance_km', 'Distance_km')
print(f"{translated_column_name} mean:")
print(distance_vehicle_mean.to_string())
Distance (km) mean: Bike 10.001067 Car 10.195471 Scooter 9.931197
We obtained the following results:
- As previously mentioned, Cramer's V matrix shows that the association between Vehicle Type and other categorical variables is nearly nonexistent (maximum value 0.05).
- The distribution of the vehicles is unbalanced: 450 deliveries by bike, 259 by scooter, and 170 by car, despite this, each group has sufficient size to estimate means robustly.
- The analysis of average delivery times per vehicle type revealed similar values: 56.47 minutes for bike, 54.96 for scooter, and 56.84 for car.
- Similarly, the average distance traveled by each vehicle type was nearly the same (around 10 km) and this suggests that vehicle type is not associated with relevant differences in either distance or average time.
These results justify the high p-value and the absence of effect may indicate that logistic differences between vehicles are offset by other operational factors not included in the dataset (e.g., city routes, assigned paths, traffic, etc.).
We now return to the model and remove, as previously anticipated, Vehicle Type and Time of Day
categorical = ['Weather', 'Traffic_Level', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience',
])
X = model.fit_transform(Food)
X.columns = [variables.get(col, col) for col in X.columns]
for i, col in enumerate(X.columns):
if '[' in col and ']' in col:
base_col = col.split('[')[0]
if base_col in variables:
X.columns.values[i] = col.replace(base_col, variables[base_col])
y = Food['Delivery_Time_min']
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| Coef. | Std.Err. | t | P>|t| | |
|---|---|---|---|---|
| intercept | 13.136 | 1.280 | 10.262 | 0.000 |
| Distance (km) | 2.912 | 0.056 | 51.845 | 0.000 |
| Weather[Foggy] | 8.107 | 1.057 | 7.667 | 0.000 |
| Weather[Rainy] | 5.058 | 0.836 | 6.051 | 0.000 |
| Weather[Snowy] | 10.565 | 1.120 | 9.437 | 0.000 |
| Weather[Windy] | 2.187 | 1.126 | 1.942 | 0.052 |
| Traffic Level[Low] | -12.235 | 0.868 | -14.096 | 0.000 |
| Traffic Level[Medium] | -6.171 | 0.865 | -7.137 | 0.000 |
| Preparation Time (min) | 0.947 | 0.044 | 21.556 | 0.000 |
| Courier Experience[Low] | 4.047 | 0.758 | 5.338 | 0.000 |
| Courier Experience[Medium] | 1.979 | 0.794 | 2.494 | 0.013 |
R2 = results.rsquared
RSE = np.sqrt(results.scale)
print("R2:", round(R2, 3))
print("RSE:", round(RSE, 3))
R2: 0.803 RSE: 9.41
By removing the two variables, we obtained a simpler and more interpretable model.
This was achieved without losing the good performance of the previous full model: R² and RSE values remain nearly identical to before.
y_pred = results.fittedvalues
plt.figure(figsize=(6, 4))
plt.scatter(y, y_pred, alpha=0.6, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title('Observed Values vs Predicted Values')
plt.grid(True)
plt.show()
residuals = results.resid
plt.figure(figsize=(6, 4))
plt.scatter(y_pred, residuals, alpha=0.6, color='darkorange')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.grid(True)
plt.show()
The model is good, but from the second plot, we can observe a clear cone-shaped pattern, with residuals starting from 0 and ranging between -20 and +20.
This pattern indicates heteroscedasticity, meaning the variance of the residuals increases as the predicted values grow.
In other words, the model's error is not constant across all levels of prediction, but rather increases as the predicted values move further from 0.
A solution to this phenomenon is to apply logarithmic, square root or Box-Cox transformations to the model.
These transformations stabilize the variance by compressing the scale of larger predicted values, making the residuals more homoscedastic and improving the model's assumptions.
def transform_y(y, method):
if method == 'log':
return np.log1p(y)
elif method == 'sqrt':
return np.sqrt(y)
elif method == 'boxcox':
if any(y <= 0):
raise ValueError("Box-Cox transformation requires y to be strictly positive.")
y_transformed, lmbda = boxcox(y)
return y_transformed, lmbda
else:
raise ValueError(f"Unknown transformation: {method}")
transformations = ['log', 'sqrt', 'boxcox']
results_summary = {}
categorical = ['Weather', 'Traffic_Level', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience',
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
for transform in transformations:
print(f"\nTransformation: {transform}")
if transform == 'boxcox':
y_transformed, lmbda = transform_y(y, transform)
else:
y_transformed = transform_y(y, transform)
mod = sm.OLS(y_transformed, X)
res = mod.fit()
y_pred = res.predict(X)
if transform == 'log':
y_pred_plot = np.expm1(y_pred)
elif transform == 'sqrt':
y_pred_plot = np.square(y_pred)
elif transform == 'boxcox':
if lmbda == 0:
y_pred_plot = np.exp(y_pred)
else:
y_pred_plot = np.power(y_pred * lmbda + 1, 1 / lmbda)
else:
y_pred_plot = y_pred
mse = mean_squared_error(y, y_pred_plot)
rmse = np.sqrt(mse)
r2 = res.rsquared
results_summary[transform] = {'R2': r2, 'MSE': mse, 'RMSE': rmse}
plt.figure(figsize=(6, 4))
plt.scatter(y, y_pred_plot, alpha=0.5, color='royalblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--')
plt.xlabel('Observed Values')
plt.ylabel('Predicted Values')
plt.title(f'Observed vs Predicted - {transform}')
plt.grid(True)
plt.show()
residuals = y_transformed - y_pred
plt.figure(figsize=(6, 4))
plt.scatter(y_pred, residuals, color='darkorange', alpha=0.5)
plt.axhline(y=0, color='k', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title(f'Residuals vs Predicted - {transform}')
plt.grid(True)
plt.show()
print("\nMetrics for each transformation:")
for key, metrics in results_summary.items():
print(f"{key}: R² = {metrics['R2']:.3f}, MSE = {metrics['MSE']:.3f}, RMSE = {metrics['RMSE']:.3f}")
Transformation: log
Transformation: sqrt
Transformation: boxcox
Metrics for each transformation: log: R² = 0.823, MSE = 116.553, RMSE = 10.796 sqrt: R² = 0.827, MSE = 94.002, RMSE = 9.695 boxcox: R² = 0.825, MSE = 92.151, RMSE = 9.600
Observing the results from the plots, no clear improvement is noticeable.
Moreover, the R² values are similar to the original model.
Therefore, we prefer to continue our work with the non-transformed model.
Cross-Validation¶
Cross-validation helps evaluate how well the model generalizes to unseen data.
By testing the model on a separate validation set, we can detect overfitting and assess real-world performance.
In this case, we split the dataset into a training set Food train and a validation set Food valid.
We explicitly define the size of the validation set (440 samples), which is exactly half of the total dataset, to ensure a consistent and controlled evaluation.
Food_train , Food_valid = train_test_split(Food, test_size=440, random_state=0)
categorical = ['Weather', 'Traffic_Level', 'Courier_Experience']
Food_train[categorical] = Food_train[categorical].astype('category')
cv_model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience',
])
X_train = cv_model.fit_transform(Food_train)
y_train = Food_train['Delivery_Time_min']
model = sm.OLS(y_train, X_train)
results = model.fit()
X_valid = cv_model.fit_transform(Food_valid)
y_valid = Food_valid['Delivery_Time_min']
valid_pred = results.predict(X_valid)
MSE = np.mean((y_valid - valid_pred)**2)
RMSE = np.sqrt(MSE)
print("MSE:", f"{MSE:.3f}")
print("RMSE:", f"{RMSE:.3f}")
MSE: 96.075 RMSE: 9.802
Leave-one-out cross-validation¶
categorical = ['Weather', 'Traffic_Level', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
loocv_model = sklearn_sm(sm.OLS, MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience',
]))
X = Food.drop(columns=['Delivery_Time_min'])
y = Food['Delivery_Time_min']
cv_results = cross_validate(loocv_model, X, y, cv=Food.shape[0])
MSE = np.mean(cv_results['test_score'])
RMSE = np.sqrt(MSE)
print("MSE:", f"{MSE:.3f}")
print("RMSE:", f"{RMSE:.3f}")
MSE: 89.766 RMSE: 9.474
Leave-one-out cross-validation is more precise but the most expensive in terms of both time and computation.
This is because it requires the model to be trained multiple times, with each iteration leaving out a single data point for validation.
Indeed, the results obtained in terms of MSE and RMSE were slightly improved compared to the previous cross-validation, but it took approximately 30 seconds, compared to 0 seconds for the previous method.
K-fold cross-validation¶
categorical = ['Weather', 'Traffic_Level', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
kf_model = sklearn_sm(sm.OLS, MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Preparation_Time_min',
'Courier_Experience'
]))
X = Food.drop(columns=['Delivery_Time_min'])
y = Food['Delivery_Time_min']
kf = KFold(n_splits=10, shuffle=True, random_state=1)
cv_results = cross_validate(kf_model, X, y, cv=kf)
MSE = np.mean(cv_results['test_score'])
RMSE = np.sqrt(MSE)
print("MSE:", f"{MSE:.3f}")
print("RMSE:", f"{RMSE:.3f}")
MSE: 89.613 RMSE: 9.466
K-fold generally strikes a balance between precision and computational cost, but in our case it actually performs better.
This is probably due to the presence of outliers, which make LOOCV less accurate and K-fold more robust and general.
Moreover, the execution time is very close to that of the first cross-validation method, so for our analysis, K-fold is better from every point of view.
Shrinkage Methods¶
We continue the analysis of the model using shrinkage methods.
These methods allow setting the coefficients of variables to zero.
To do this, it makes sense to return to a model with all variables.
This means adding back the categorical variables Time of Day and Vehicle Type.
It is also important to normalize the model, as this ensures all variables are on the same scale.
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
numerical_columns = ['Distance_km', 'Preparation_Time_min']
scaler = StandardScaler()
X[numerical_columns] = scaler.fit_transform(X[numerical_columns])
X.columns = [variables.get(col, col) for col in X.columns]
for i, col in enumerate(X.columns):
if '[' in col and ']' in col:
base_col = col.split('[')[0]
if base_col in variables:
X.columns.values[i] = col.replace(base_col, variables[base_col])
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
| Coef. | Std.Err. | t | P>|t| | |
|---|---|---|---|---|
| intercept | 59.308 | 1.040 | 57.046 | 0.000 |
| Distance (km) | 16.526 | 0.319 | 51.796 | 0.000 |
| Weather[Foggy] | 8.116 | 1.057 | 7.676 | 0.000 |
| Weather[Rainy] | 4.996 | 0.837 | 5.967 | 0.000 |
| Weather[Snowy] | 10.399 | 1.125 | 9.244 | 0.000 |
| Weather[Windy] | 2.123 | 1.129 | 1.880 | 0.060 |
| Traffic Level[Low] | -12.292 | 0.869 | -14.152 | 0.000 |
| Traffic Level[Medium] | -6.115 | 0.865 | -7.072 | 0.000 |
| Time of Day[Evening] | -0.007 | 0.825 | -0.009 | 0.993 |
| Time of Day[Morning] | -0.852 | 0.816 | -1.044 | 0.297 |
| Time of Day[Night] | -1.533 | 1.237 | -1.239 | 0.216 |
| Vehicle Type[Car] | -0.264 | 0.854 | -0.309 | 0.757 |
| Vehicle Type[Scooter] | -1.387 | 0.738 | -1.878 | 0.061 |
| Preparation Time (min) | 6.878 | 0.319 | 21.568 | 0.000 |
| Courier Experience[Low] | 4.091 | 0.758 | 5.397 | 0.000 |
| Courier Experience[Medium] | 1.957 | 0.795 | 2.461 | 0.014 |
Ridge regression¶
We begin by splitting the database into train and test sets.
Next, we search for the optimal lambda for ridge regression.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
lambda_values = np.logspace(-2, 4, 300)
pipeline = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': lambda_values}
ridge_cv = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=10)
ridge_cv.fit(X_train, y_train)
best_alpha = ridge_cv.best_params_['ridge__alpha']
print(f"Best Ridge lambda: {best_alpha:.3f}")
Best Ridge lambda: 5.359
We evaluate the test set using the lambda found.
ridge_best = ridge_cv.best_estimator_
ridge_pred = ridge_best.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)
coefficients = np.round(ridge_best.named_steps['ridge'].coef_, 3)
coeff_df = pd.DataFrame({'Coefficient': coefficients}, index=X_train.columns)
def style_coeff(val):
styles = []
if val >= 0:
styles.append('color: green')
else:
styles.append('color: red')
styles.append('font-weight: bold')
return '; '.join(styles)
row_styles = [
{'selector': f'tbody tr:nth-child({i+1})', 'props': [('opacity', '0.35')]}
for i, val in enumerate(coeff_df['Coefficient'])
if abs(val) < 0.1
]
coeff_df_styled = (
coeff_df.style
.format({'Coefficient': '{:.3f}'})
.map(style_coeff, subset=['Coefficient'])
.set_table_styles(
[{'selector': 'th.row_heading', 'props': [('font-weight', 'bold')]}] + row_styles
)
)
coeff_df_styled
| Coefficient | |
|---|---|
| intercept | 0.000 |
| Distance (km) | 16.060 |
| Weather[Foggy] | 2.008 |
| Weather[Rainy] | 1.264 |
| Weather[Snowy] | 3.575 |
| Weather[Windy] | 0.717 |
| Traffic Level[Low] | -5.655 |
| Traffic Level[Medium] | -2.404 |
| Time of Day[Evening] | -0.067 |
| Time of Day[Morning] | -0.031 |
| Time of Day[Night] | -0.697 |
| Vehicle Type[Car] | 0.431 |
| Vehicle Type[Scooter] | -0.162 |
| Preparation Time (min) | 6.807 |
| Courier Experience[Low] | 1.409 |
| Courier Experience[Medium] | 0.116 |
From the results, we see that the less significant variables (also observed in previous models and methods) have coefficients very close to zero for ridge regression.
This occurs because ridge regression shrinks coefficients of less relevant variables toward zero, improving model generalization.
We also notice that the intercept is exactly zero.
MSE = ridge_mse
RMSE = np.sqrt(MSE)
print("MSE:", f"{MSE:.3f}")
print("RMSE:", f"{RMSE:.3f}")
MSE: 96.687 RMSE: 9.833
The results in this case are similar to those obtained previously with the three cross-validation methods (50%, LOOCV, and KF), with a slight improvement of about 2% referring to the RMSE.
coefs = []
feature_names = X_train.columns
for a in lambda_values:
pipe = make_pipeline(StandardScaler(), Ridge(alpha=a))
pipe.fit(X_train, y_train)
coefs.append(pipe.named_steps['ridge'].coef_)
coefs = np.array(coefs)
fig, ax = plt.subplots(figsize=(14, 8))
colors = plt.cm.tab20(np.linspace(0, 1, len(feature_names)))
for idx, name in enumerate(feature_names):
ax.plot(lambda_values, coefs[:, idx], label=name, color=colors[idx])
ax.set_xscale('log')
plt.xlabel('Lambda', fontsize=15)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.ylabel('Coefficients', fontsize=15)
plt.title('Ridge Coefficients as Function of Regularization', fontsize=20)
plt.ylim(-10, 20)
plt.xlim(1e-1, 1e4)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.tight_layout()
handles, labels = ax.get_legend_handles_labels()
plt.show()
fig_legend = plt.figure(figsize=(12, 2))
ax_leg = fig_legend.add_subplot(111)
ax_leg.axis("off")
ax_leg.legend(handles=handles, labels=labels,
loc='center', ncol=4, fontsize=15)
fig_legend.tight_layout()
plt.show()
The results of the coefficients can also be seen in this graph, at the point corresponding to the red dashed line, which indicates the optimal lambda.
It is noticeable that the important variables are those farthest from zero.
It is also noticeable that ridge regression drives the coefficients toward zero as lambda increases.
mean_test_scores = -ridge_cv.cv_results_['mean_test_score']
alphas = ridge_cv.param_grid['ridge__alpha']
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_test_scores)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.xlabel('Lambda', fontsize=15)
plt.ylabel('Mean Squared Error', fontsize=15)
plt.title('Ridge Cross-Validation Error', fontsize=20)
plt.ylim(87, 88)
plt.xlim(1e-1, 1e2)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.show()
In this other plot, we can see that the minimum MSE is found at the optimal lambda.
Lasso regression¶
We will now proceed in the same manner with lasso regression.
lasso_pipe = make_pipeline(StandardScaler(), Lasso(max_iter=10000))
lasso_cv = GridSearchCV(lasso_pipe,
{'lasso__alpha': lambda_values},
scoring='neg_mean_squared_error',
cv=10)
lasso_cv.fit(X_train, y_train)
best_alpha = lasso_cv.best_params_['lasso__alpha']
print(f"Best Lasso lambda: {best_alpha:.3f}")
Best Lasso lambda: 0.146
lasso_best = lasso_cv.best_estimator_
lasso_pred = lasso_best.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
print(f"Number of initial coefficients: {len(feature_names)}")
final_coefs = np.round(lasso_best.named_steps['lasso'].coef_, 3)
print(f"Number of non-zero coefficients: {np.sum(final_coefs != 0)}")
coeff_df = pd.DataFrame({'Coefficient': final_coefs}, index=feature_names)
def style_coeff(val):
styles = []
if val == 0:
styles.append('color: gray')
elif val > 0:
styles.append('color: green')
else:
styles.append('color: red')
styles.append('font-weight: bold')
return '; '.join(styles)
row_styles = [
{'selector': f'tbody tr:nth-child({i+1})', 'props': [('opacity', '0.35')]}
for i, val in enumerate(coeff_df['Coefficient'])
if val == 0.000
]
coeff_df_styled = (
coeff_df.style
.format({'Coefficient': '{:.3f}'})
.map(style_coeff, subset=['Coefficient'])
.set_table_styles(
[{'selector': 'th.row_heading', 'props': [('font-weight', 'bold')]}] + row_styles
)
)
coeff_df_styled
Number of initial coefficients: 16 Number of non-zero coefficients: 12
| Coefficient | |
|---|---|
| intercept | 0.000 |
| Distance (km) | 16.156 |
| Weather[Foggy] | 1.794 |
| Weather[Rainy] | 1.020 |
| Weather[Snowy] | 3.374 |
| Weather[Windy] | 0.534 |
| Traffic Level[Low] | -5.356 |
| Traffic Level[Medium] | -2.135 |
| Time of Day[Evening] | -0.000 |
| Time of Day[Morning] | 0.000 |
| Time of Day[Night] | -0.524 |
| Vehicle Type[Car] | 0.334 |
| Vehicle Type[Scooter] | -0.065 |
| Preparation Time (min) | 6.738 |
| Courier Experience[Low] | 1.198 |
| Courier Experience[Medium] | 0.000 |
With lasso regression, the less significant variables are driven exactly to zero, allowing for model simplification.
Similarly, the intercept remains zero in this case as well.
MSE = lasso_mse
RMSE = np.sqrt(MSE)
print("MSE:", f"{MSE:.3f}")
print("RMSE:", f"{RMSE:.3f}")
MSE: 97.586 RMSE: 9.879
The results compared to ridge regression are practically identical.
coefs = []
feature_names = X_train.columns
for a in lambda_values:
pipe = make_pipeline(StandardScaler(), Lasso(alpha=a, max_iter=10000))
pipe.fit(X_train, y_train)
coefs.append(pipe.named_steps['lasso'].coef_)
coefs = np.array(coefs)
fig, ax = plt.subplots(figsize=(14, 8))
colors = plt.cm.tab20(np.linspace(0, 1, len(feature_names)))
for idx, name in enumerate(feature_names):
ax.plot(lambda_values, coefs[:, idx], label=name, color=colors[idx])
ax.set_xscale('log')
plt.xlabel('Lambda', fontsize=15)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.ylabel('Coefficients', fontsize=15)
plt.title('Lasso Coefficients as Function of Regularization', fontsize=20)
plt.ylim(-10, 20)
plt.xlim(1e-2, 1e2)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.tight_layout()
handles, labels = ax.get_legend_handles_labels()
plt.show()
fig_legend = plt.figure(figsize=(12, 2))
ax_leg = fig_legend.add_subplot(111)
ax_leg.axis("off")
ax_leg.legend(handles=handles, labels=labels,
loc='center', ncol=4, fontsize=15)
fig_legend.tight_layout()
plt.show()
Unlike the same plot for ridge regression, we notice that in lasso regression, the coefficients go exactly to zero as the optimal lambda increases.
mean_test_scores = -lasso_cv.cv_results_['mean_test_score']
alphas = lasso_cv.param_grid['lasso__alpha']
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_test_scores)
plt.axvline(best_alpha, color='r', linestyle='--')
plt.xlabel('Lambda', fontsize=15)
plt.ylabel('Mean Squared Error', fontsize=15)
plt.title('Lasso Cross-Validation Error', fontsize=20)
plt.ylim(86, 89)
plt.xlim(1e-2, 1e1)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.show()
As with the optimal lambda for ridge regression, the optimal lambda for lasso regression is also at the minimum of the MSE function.
Regression Trees¶
Regression trees are decision tree models that split the data into regions based on input variables, aiming to predict the output by minimizing variability within each region.
We return to a non-normalized model, as normalization is not needed because tree-based methods are not sensitive to the scale of the variables.
As with previous methods, we split the data into train and test sets and use a DecisionTreeRegressor.
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
X.columns = [variables.get(col, col) for col in X.columns]
for i, col in enumerate(X.columns):
if '[' in col and ']' in col:
base_col = col.split('[')[0]
if base_col in variables:
X.columns.values[i] = col.replace(base_col, variables[base_col])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=1
)
tree_model = DecisionTreeRegressor(random_state=2)
tree_model.fit(X_train, y_train)
DecisionTreeRegressor(random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=2)
feature_importances = tree_model.feature_importances_
feature_names = tree_model.feature_names_in_
features_with_importances = list(zip(feature_names, feature_importances))
print("Regression Tree:")
print(f"Tree depth: {tree_model.get_depth()}")
print(f"Number of leaves: {tree_model.get_n_leaves()}")
importance_df = pd.DataFrame({
'Importance': np.round(feature_importances, 3)
}, index=feature_names)
def style_importance(val):
return 'font-weight: bold'
row_styles = [
{'selector': f'tbody tr:nth-child({i+1})', 'props': [('opacity', '0.35')]}
for i, val in enumerate(importance_df['Importance'])
if val < 0.05
]
importance_df_styled = (
importance_df.style
.format({'Importance': '{:.3f}'})
.map(style_importance, subset=['Importance'])
.set_table_styles(
[{'selector': 'th.row_heading', 'props': [('font-weight', 'bold')]}] + row_styles
)
)
importance_df_styled
Regression Tree: Tree depth: 16 Number of leaves: 407
| Importance | |
|---|---|
| intercept | 0.000 |
| Distance (km) | 0.706 |
| Weather[Foggy] | 0.005 |
| Weather[Rainy] | 0.006 |
| Weather[Snowy] | 0.026 |
| Weather[Windy] | 0.009 |
| Traffic Level[Low] | 0.024 |
| Traffic Level[Medium] | 0.005 |
| Time of Day[Evening] | 0.012 |
| Time of Day[Morning] | 0.006 |
| Time of Day[Night] | 0.006 |
| Vehicle Type[Car] | 0.005 |
| Vehicle Type[Scooter] | 0.005 |
| Preparation Time (min) | 0.163 |
| Courier Experience[Low] | 0.014 |
| Courier Experience[Medium] | 0.005 |
The resulting tree is very complex, with 16 levels and 407 leaves, suggesting strong overfitting.
Regarding variable importance information, the results are consistent with those observed in the previous models.
plt.figure(figsize=(50, 50))
plot_tree(tree_model,
feature_names=X.columns,
filled=True,
rounded=True,
proportion=True,
fontsize=12,
precision=2)
plt.title('Regression Tree', fontsize=16)
plt.show()
The plot confirms the enormous complexity of the tree.
y_pred = tree_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.3f}")
print(f"R2: {r2:.3f}")
print(f"MSE: {mse:.3f}")
print(f"RMSE: {np.sqrt(mse):.3f}")
MAE: 10.109 R2: 0.518 MSE: 214.495 RMSE: 14.646
The residual analysis shows worse results compared to the previous models, although worse performance was expected given the enormous size of the tree.
We proceed to simplify the tree by studying the maximum depth.
max_depth_values = range(1, 100)
cv_scores = []
for depth in max_depth_values:
tree_cv = DecisionTreeRegressor(criterion='squared_error', max_depth=depth, random_state=2)
scores = cross_val_score(tree_cv, X_train, y_train, cv=10, scoring='r2')
cv_scores.append(np.mean(scores))
plt.figure(figsize=(10, 6))
plt.plot(max_depth_values, cv_scores, 'o-')
plt.xlabel('Max Depth', fontsize=15)
plt.ylabel('Cross-Validation R²', fontsize=15)
plt.title('Cross-Validation Results for Tree Pruning', fontsize=20)
plt.grid(True)
plt.xlim(0, 100)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.show()
The graph shows the effect of the decision tree depth on performance, using cross-validation and the R² score.
The results show an initial increase in R² followed by a decrease, indicating the optimization between underfitting and overfitting.
The optimal depth seems to be 3, as this is where R² is maximum.
best_depth = max_depth_values[np.argmax(cv_scores)]
print(f"Best depth from cross-validation: {best_depth}")
Best depth from cross-validation: 3
The optimal depth of 3 is confirmed, so we proceed with pruning.
pruned_tree = DecisionTreeRegressor(max_depth=best_depth, random_state=2)
pruned_tree.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=3, random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=3, random_state=2)
plt.figure(figsize=(30, 10))
plot_tree(pruned_tree,
feature_names=X.columns,
filled=True,
rounded=True,
proportion=True,
fontsize=13,
precision=2)
plt.title(f'Pruned Regression Tree (Max Depth = {best_depth})', fontsize=20)
plt.show()
The pruned tree is much simpler and easier to explain.
The only variables considered in the end are Distance (km) and Preparation Time (min).
Dark orange indicate more extreme predicted values (higher or lower), while lighter or white colors indicate values close to the target mean.
These do not represent categories, but rather continuous value gradations.
y_pred_pruned = pruned_tree.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_pruned)
mse = mean_squared_error(y_test, y_pred_pruned)
r2 = r2_score(y_test, y_pred_pruned)
print(f"MAE: {mae:.3f}")
print(f"R2: {r2:.3f}")
print(f"MSE: {mse:.3f}")
print(f"RMSE: {np.sqrt(mse):.3f}")
MAE: 9.438 R2: 0.608 MSE: 174.627 RMSE: 13.215
With pruning, the results have also improved.
Bagging¶
We now try using ensemble methods for trees, starting with Bagging.
Bagging builds multiple trees on different random subsets of the data and averages their predictions.
This technique reduces variance and helps to improve model stability and accuracy.
categorical = ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type', 'Courier_Experience']
Food[categorical] = Food[categorical].astype('category')
model = MS([
'Distance_km',
'Weather',
'Traffic_Level',
'Time_of_Day',
'Vehicle_Type',
'Preparation_Time_min',
'Courier_Experience'
])
X = model.fit_transform(Food)
y = Food['Delivery_Time_min']
X.columns = [variables.get(col, col) for col in X.columns]
for i, col in enumerate(X.columns):
if '[' in col and ']' in col:
base_col = col.split('[')[0]
if base_col in variables:
X.columns.values[i] = col.replace(base_col, variables[base_col])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=1
)
For Bagging, we use a Random Forest model with max features set equal to the total number of features.
This ensures that each tree is built using all available variables, maintaining the standard Bagging approach where randomness comes only from the data sampling, not from feature selection.
bagg_model = RandomForestRegressor(
n_estimators=500,
max_features=X.shape[1],
random_state=1,
bootstrap=True
)
bagg_model.fit(X_train, y_train)
RandomForestRegressor(max_features=16, n_estimators=500, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=16, n_estimators=500, random_state=1)
error_rate = []
X_test_array = X_test.to_numpy()
for i in range(1, 501, 10):
y_pred_bagg = np.mean([t.predict(X_test_array) for t in bagg_model.estimators_[:i]], axis=0)
error_rate.append(mean_squared_error(y_test, y_pred_bagg))
plt.figure(figsize=(10, 6))
plt.plot(range(1, 501, 10), error_rate, 'r-', lw=2, label='Bagging')
plt.xlabel('Number of Trees', fontsize=15)
plt.ylabel('Test MSE', fontsize=15)
plt.title('Test MSE vs Number of Trees (Bagging)', fontsize=20)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.legend()
plt.grid(True)
plt.show()
In this graph, we see that as the number of trees increases, the MSE decreases.
In particular, it stabilizes from around 100 trees onward.
y_pred_bagg = bagg_model.predict(X_test)
mse_bagg = mean_squared_error(y_test, y_pred_bagg)
print(f"Bagging MSE: {mse_bagg:.3f}")
print(f"Bagging RMSE: {np.sqrt(mse_bagg):.3f}")
Bagging MSE: 132.252 Bagging RMSE: 11.500
The residual analysis shows better results compared to previous tree-based models.
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_bagg)
plt.plot([0, 100], [0, 100], 'k--')
plt.xlabel('Actual', fontsize=15)
plt.ylabel('Predicted', fontsize=15)
plt.title('Bagging: Actual vs Predicted', fontsize=20)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.grid(True)
plt.show()
If the points lie on the bisector, the model is working well.
The model estimated with Bagging is accurate.
It performs better than a single tree but is still worse than linear models.
Some outliers are present.
importance_bagg = pd.Series(bagg_model.feature_importances_, index=X.columns)
importance_bagg.sort_values(ascending=False, inplace=True)
plt.figure(figsize=(18, 8))
importance_bagg.plot(kind='bar')
plt.title('Feature Importance (Bagging)', fontsize=20)
plt.xticks(rotation=45, fontsize=13)
plt.yticks(fontsize=13)
plt.tight_layout()
plt.show()
In this other plot, we see that the variable Distance (km) dominates, followed by Preparation Time (min), although with less than half the importance of the first.
These results were expected considering the node values of the pruned tree built earlier.
Random forest¶
Using Random Forest, which is an extension of Bagging, we introduce an additional layer of randomness by selecting a random subset of features at each split, rather than considering all features.
This helps to reduce correlation between trees, improving the model's generalization and reducing overfitting compared to Bagging, where all features are used in each tree split.
forest_model = RandomForestRegressor(
n_estimators=100,
max_features='sqrt',
random_state=1
)
forest_model.fit(X_train, y_train)
RandomForestRegressor(max_features='sqrt', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features='sqrt', random_state=1)
error_rate_rf = []
X_test_array = X_test.to_numpy()
for i in range(1, 501, 10):
y_pred_rf = np.mean([t.predict(X_test_array) for t in forest_model.estimators_[:i]], axis=0)
error_rate_rf.append(mean_squared_error(y_test, y_pred_rf))
plt.figure(figsize=(10, 6))
plt.plot(range(1, 501, 10), error_rate_rf, 'g-', label='Random Forest')
plt.plot(range(1, 501, 10), error_rate, 'r-', lw=2, label='Bagging')
plt.xlabel('Number of Trees', fontsize=15)
plt.ylabel('Test MSE', fontsize=15)
plt.title('Test MSE vs Number of Trees', fontsize=20)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.legend(fontsize=15)
plt.grid(True)
plt.show()
In our case, as seen from the plot, Random Forest performs slightely better than Bagging.
This could be due to Random Forest's random feature selection, which slightly reduces the influence of numerical variables while slightly improving the performance of categorical variables, leading to a better balance between them compared to Bagging.
y_pred_rf = forest_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest MSE: {mse_rf:.3f}")
print(f"Random Forest RMSE: {np.sqrt(mse_rf):.3f}")
Random Forest MSE: 131.571 Random Forest RMSE: 11.470
importance_rf = pd.Series(forest_model.feature_importances_, index=X.columns)
importance_rf.sort_values(ascending=False, inplace=True)
plt.figure(figsize=(18, 8))
importance_rf.plot(kind='bar')
plt.title('Feature Importance (Random Forest)', fontsize=20)
plt.xticks(rotation=45, fontsize=13)
plt.yticks(fontsize=13)
plt.tight_layout()
plt.show()
Our previous statement is confirmed by the graph.
Boosting¶
As the last tree-based method, we use Boosting.
Boosting is an ensemble technique that builds multiple trees sequentially, where each tree corrects the errors of the previous one.
boost_model = GradientBoostingRegressor(
n_estimators=5000,
max_depth=4,
learning_rate=0.001,
loss='squared_error',
random_state=1
)
boost_model.fit(X_train, y_train)
GradientBoostingRegressor(learning_rate=0.001, max_depth=4, n_estimators=5000,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.001, max_depth=4, n_estimators=5000,
random_state=1)y_pred_boost = boost_model.predict(X_test)
mse_boost = mean_squared_error(y_test, y_pred_boost)
print(f"Boosting MSE: {mse_boost:.3f}")
print(f"Boosting RMSE: {np.sqrt(mse_boost):.3f}")
methods = ['Bagging', 'Random Forest', 'Boosting']
mse_values = [mse_bagg, mse_rf, mse_boost]
Boosting MSE: 122.121 Boosting RMSE: 11.051
Based on the results, it appears that Boosting is the most effective tree-based method.
However, it still underperforms compared to the initial linear regression models, which remain the most accurate for this dataset.
plt.figure(figsize=(10, 6))
plt.bar(methods, mse_values)
plt.ylabel('Test MSE', fontsize=15)
plt.title('Comparison of Tree-Based Methods', fontsize=20)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
for i, v in enumerate(mse_values):
plt.text(i, v + 0.1, f"{v:.3f}", ha='center', fontsize=15)
plt.tight_layout()
plt.show()
The graph compares the performance of three ensemble methods: Bagging, Random Forest, and Boosting. As shown, Boosting achieving slightly better results.